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Abstract 

There has been significant progress in understanding the process of protein translation in recent years. 
One of the best examples is the discovery of usage bias in successive synonymous codons and its role in 
eukaryotic translation efficiency. We observed here a similar type of bias in the other two life domains, 
bacteria and archaea, although the bias strength was much smaller than in eukaryotes. Among 136 pro- 
karyotic genomes, 98 were found to have significant bias from random use of successive synonymous 
codons with Z scores larger than three. Furthermore, significantly different bias strengths were found 
between prokaryotes grouped by various genomic or biochemical characteristics. Interestingly, the bias 
strength measured by a general Z score could be fitted well (R = 0.83, P < 10 15 ) by three genomic vari- 
ables: genome size, G + C content, and tRNA gene number based on multiple linear regression. A different 
distribution of synonymous codon pairs between protein-coding genes and intergenic sequences suggests 
that bias is caused by translation selection. The present results indicate that protein translation is tuned 
by codon (pair) usage, and the intensity of the regulation is associated with genome size, tRNA gene 
number, and G + C content. 

Key words: successive synonymous codon bias; comparative analysis; prokaryotic genomes; Z scores; 
translation efficiency 



1. Introduction 

Recent studies show that protein-coding genes use 
codon patterns to fine-tune translation and increase 
protein synthesis efficiency. 1-7 Three types of codon 
usages have been proposed to influence translation. 
First, use of single codons may influence the speed 
and accuracy of translation. 3,5-1 1 Frequent use of 
'preferred codons' is believed to maximize translation 
efficiency. 1 2 This hypothesis takes previous observa- 
tions in Escherichia coli and several other unicellular 
microbes as supporting evidence. 13 In these small 
organisms, highly expressed genes are found to have 
more extreme codon bias, where codon bias denotes 



non-equilibrium usage of up to six synonymous 
codons encoding for the same amino acids, with 'pre- 
ferred codons' in highly expressed genes correspond- 
ing to the most abundant isoaccepting tRNAs. 12 

Second, codon pair usage is associated with transla- 
tion efficiency, wherein a codon pair indicates two 
successive codons. 14-17 The biased use of codon 
pairs is a common phenomenon in a wide range of 
species. 18 The observed codon pair frequency often 
deviates from expected values predicted from two 
single codons. Some codon pairs are overrepresented 
and others underrepresented. A variety of selective 
or non-selective factors are responsible for such 
bias. 19 One such factor is that codon pair usage 
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affects translation. 1 5,20 In fact, peptide bond forma- 
tion requires simultaneous accommodation of two 
codons and of two tRNAs in the ribosomal A and P 
sites. 21 For spatial reasons, it is thought that not all 
codon and tRNA combinations are equally compatible 
on the ribosome surface. 22 Some combinations of 
codon pairs and tRNAs would be advantageous for 
translation efficiency. 20 Structural features that regu- 
late tRNA geometry within the ribosome govern 
codon pair patterns, driving enhanced translational fi- 
delity and/or rate. 14 Experimental results support 
such a mechanism. 1 7,22 

Third, an interesting bias of successive synonymous 
codon pairs was found in eight eukaryotic 
genomes. 1 ,2,4 Synonymous codon pair denotes a 
codon that recurs after its synonymy within a gene, re- 
gardless of how many codons encode other amino 
acids, and requiring only that there are not other syn- 
onymies between the two. In this study, bias of syn- 
onymous codon pairs denotes a difference between 
actual and expected frequencies when they are inde- 
pendent. Cannarozzi et al.^ found a strong tendency 
to use the same codon a second time as for the first 
synonymy. There is a bias towards the most closely 
related synonymous wobble codons, if the same 
codon is not reused. This predisposition towards select- 
ing particular codons, rather than arbitrarily choosing 
one from the successive synonymous set, is termed 
'autocorrelation' or 'codon reuse', and has important 
implications for protein translation. Based on compre- 
hensive analyses of highly expressed genes, it was sug- 
gested that codon reuse may provide an effective 
mechanism to speed up translation. 1 Through wet- 
bench experiments, it was successfully demonstrated 
that translation on autocorrelated mRNA was substan- 
tially (30%) faster than on anti-correlated mRNA. 1 
Therefore, this result reinforces the speculation that 
codon reuse could benefit translation efficiency. 1 

Due to its intimate relationship with translation 
efficiency, biased use of single codons and codon 
pairs has been studied extensively. However, biased 
use of successive codon pairs is relatively new. 1,2 
Cannarozzi and colleagues observed this phenom- 
enon in eukaryotes. 1 Here, we performed cross- 
species analysis of 'codon reuse' in 136 prokaryotes. 
We not only show the existence of 'codon reuse' in 
various prokaryotic genomes and hence illustrate it 
as a universal mechanism among the three domains 
of life, but also compare the level of biases among 
various prokaryotes. Most importantly, we observed 
that the overall bias intensity for successive synonym- 
ous codon pairs is positively correlated with several 
genomic factors. Using genomic G + C content, 
genome size, and tRNA gene numbers as limiting 
factors, the bias value could be predicted with high ac- 
curacy. Thus, these data reinforce the notion that the 



genome contains all the information necessary for 
regulating protein translation. 



2. Materials and methods 

We randomly picked one strain from each of pro- 
karyotic species sequenced. Genome sequences and 
annotations were downloaded from the NCBI RefSeq 
project (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) for 
136 prokaryotic strains before June 2010. These 
136 strains consisted of 109 bacteria and 2 7 
archaea. Information on genome size, G + C content, 
and tRNA gene number was extracted from the 
RefSeq annotation. 23 For all genomes, tRNA was 
assigned to 64 codons according to the extended 
wobble rule. 24 According to the wobble rule, we 
adopt the consistent pattern for assigning codons to 
isoaccepting tRNAs for all 136 prokaryotic genomes, 
and it is similar with Cannarozzi et a\? In 
Supplementary Table S1, the correspondence of 
codons to tRNAs is illustrated with £. coli as an 
example. In fact, there may exist some modifications 
of the wobble rule. For example, large bacterial 
genomes with high G + C% usually have tRNAs with 
a C-starting anticodon solely responsible for a 
G-ending codon, and the number of this tRNA gene 
is often multiple and larger than that for the respect- 
ive isoaccepting tRNA responsible for both G- and 
A-ending codons. Another example is the presence 
of I in Arg tRNAs found in a wide range of bacteria. 
While A-ending codons are thought to be recognized 
by l-containing tRNA on the basis of the extended 
wobble rule, the efficiency of l-A recognition is low 
and there are often (but not always) other tRNAs 
responsible for A-ending (and G-ending) codons. 
However, we do not know the modification will 
appear in which specific genome. Therefore, we do 
not consider any of the modifications when calculat- 
ing the bias of successive synonymous codon pairs. 

For comparison, correlation of synonymous codon 
pairs in eight eukaryotes was also investigated 
(Arabidopsis thaliana, Ashbya gossypii, Caenorhabditis 
elegans, Candida glabrata, Drosophila melanogaster, 
Homo sapiens, Saccharomyces cerevisiae, and 
Schizosaccharomyces pombe). 

We focused only on pairs of consecutive synonym- 
ous codons, which may be separated by any number 
of codons from other amino acids, in each prokaryotic 
genome. We used the Z score 25 defined in Equation 
(1) to evaluate the difference in the actual number 
from the expected number of consecutive synonym- 
ous codon pairs and isoaccepting tRNA pairs. Similar 
to Cannarozzi et a\.? the number of synonymous 
codon pairs of the nine amino acids with at least 
two tRNAs was counted. And, the expected number 
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of synonymous codon pairs was calculated as the 
products of the frequencies of the individual codons 
of each pair in each prokaryotes. A negative Z score 
means that the actual frequency is below the 
expected frequency, whereas a positive Z score 
means that the former is above the latter. 25 The 
more positive a score is, the stronger is the translation 
selection in the considered synonymous codon pair. 

ActuaLnumber - Expected .number . , 

Zscore = , — ,—, — . — . ( 1 ) 

Standard_deviation 

The standard deviation in Equation (1) is calculated 
based on actual numbers and expected numbers of 
the collection of all synonymous codon pairs or isoac- 
cepting tRNA pairs. The distribution of codon pairs 
in Fig. 1 and genomes in Fig. 2 is fitted by the Gauss 
function. 25 A linear relationship between various 
genomic factors and the general Z score is fitted 
with single variable or multiple linear regression. 25 
Differences in Z scores between two groups of 
prokaryotic genomes were statistically validated by 
t-tests. 25 All statistical analyses were implemented 
with the freely available R package (http://www. 
r-project.org/). 

3. Results 

3. 7. Synonymous codon correlation in £. coli K1 2 
genome 

We evaluated 107 bacterial and 29 archaeal 
genomes. In the following two sections, £. coli K1 2 is 
taken as an example. We evaluated all pairs of con- 
secutive synonymous codons in the £. coli K1 2. Pairs 
coding for nine amino acids (alanine, arginine, 
glycine, isoleucine, proline, leucine, serine, threonine, 
and valine), which have at least two tRNAs, were 



considered for further analysis. The frequency of all 
combinations (e.g. TCCTCT as one combination) was 
then calculated. Assuming a random distribution, 
the expected number of all combinations could be 
estimated from the actual single codon frequencies. 
According to Equation (1), the Z score quantifies the 
difference between the actual frequency of a combin- 
ation from the expected frequency, in terms of the 
number of standard deviations. 25 We classified each 
synonymous codon pair as favoured, if the difference 
is larger than 3 s.d., as neutral if between -3 and 
+ 3 s.d., or disfavoured if less than -3 s.d. The 
numbers for three groups of synonymous codon 
pairs for each amino acid are summarized in 
Table 1. Among codon pairs with isoacceptors 
(sharing a tRNA), favoured numbers are all larger 
than neutral or disfavoured ones. However, the oppos- 
ite result is obtained for synonymous codon pairs 
without isoacceptors. These results indicate that the 
reuse of codons sharing the same tRNA is a universal 
phenomenon for the nine £. coli amino acids exam- 
ined. This observation is similar to that seen in eukar- 
yotes. 1 It is worth noting that the strength of 
synonymous codon correlation in £. coli is much 
smaller than that in yeast. For example, the mean Z 
score for 1 0 groups of synonymous codon pairs en- 
coding the amino acid serine is 8.6469 in yeast 1 
and only 5.8762 in £. coli. As mentioned below, Z 
scores of synonymous codon pairs in prokaryotes are 
generally lower than those seen in eukaryotes. 

3.2. Confirming the hypothesis of translation selection 
One reason may account for the codon correlation 
illustrated above. There is selection pressure for 
codon ordering in protein-coding genes, and, hence, 
synonymous codons sharing the same tRNA are suc- 
cessively used. 1 However, there also exists a second 



Table 1 . The numbers of three kinds of synonymous codon pairs for nine amino acids in E. coli 



Isoaccepting Non-isoaccepting 
Grouped by Favoured Neutral Disfavoured Favoured Neutral Disfavoured 



Ala 


4 


2 


2 


1 


1 


6 


Arg 


6 


2 


4 


1 1 


3 


1 0 


Gly 


4 


2 


0 


2 


1 


7 


lie 


3 


0 


2 


0 


2 


2 


Leu 


5 


3 


2 


6 


1 3 


7 


Pro 


6 


0 


2 


2 


2 


4 


Ser 


8 


2 


0 


1 


1 1 


1 4 


Thr 


4 


2 


0 


1 


3 


6 


Val 


4 


2 


0 


2 


4 


4 


Total 


44 


1 5 


1 2 


26 


40 


60 



Codon pairs are grouped into those with and without isoacceptors (sharing a tRNA), by parsimony. Within each group, pairs 
were classified as favoured (>3 s.d.), neutral (between -3 and +3 s.d.), or disfavoured (< — 3 s.d.). 
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Figure 1 . Z score histograms for two groups (isoaccepting and non- 
isoaccepting) of codon pairs in three types of sequences in £ coli 
K1 2. (A) Z score histograms for two groups (isoaccepting and 
non-isoaccepting) of codon pairs in gene sequences. The 
means of the two distributions are different with a P-value = 
2.2e-14. (B) Z score histograms for two groups (isoaccepting 
and non-isoaccepting) of codon pairs in sequences generated 
by randomly shuffling. The means of the two distributions are 
different with a P-value= 3.3e-1 1. (C) Z score histograms for 
two groups (isoaccepting and non-isoaccepting) of codon pairs 
in intergenic sequences. The means of the two distributions 
are different with a P-value = 1 ,0e-3. The difference between 
the two types of codon pairs for intergenic sequences is not 
only much smaller than that for the gene sequences but also 
quite smaller than that for the shuffled gene sequences. 
Therefore, the pattern of codon reuse is present in protein- 
coding sequences, and the conserved pattern appears to be 
rooted in translation selection. 

explanation. 1 Different genes may be enriched in dif- 
ferent codons, and the correlation observed at the 
genomic level may be due to the accumulation of 
given codons in specific genes. If the second case is 
real, the synonymous codon correlation should 
remain, if codon distribution is shuffled within each 
gene individually. 1 In the first case, such codon shuf- 
fling would reduce the difference of codon correlation 
between isoacceptor pairs and non-isoacceptor pairs. 
We performed shuffle experiments in E. coli K1 2 to 
test which hypothesis is correct. Results are shown in 




"h 1 1 1 1 

-5 0 5 10 15 20 

Standard deviations from expected 

Figure 2. Histogram of the general Z scores among 1 36 prokaryotic 
genomes. 

Fig. 1 , where the vertical axis denotes synonymous 
codon pair frequency in the corresponding range of 
Z scores. Similar to Cannarozzi et a/., 1 codon correla- 
tions were found to decrease for isoaccepting pairs 
for the shuffled genes and increase for non-isoaccept- 
ing pairs. The difference between the isoaccepting 
pairs and non-isoaccepting pairs without shuffle 
(Fig. 1A, t-test, P=2.2x10" 14 ) is significantly 
larger than the difference after shuffle (Fig. 1 B, t- 
test, P = 3.3 x 10 -11 ). Thus, autocorrelation was not 
simply due to codon bias at the gene level, but due 
to codon ordering within genes. To reinforce the hy- 
pothesis of translation selection, we also analysed syn- 
onymous codon correlation in non-coding regions 
(i.e. intergenic sequences). The Z scores histograms 
for isoaccepting and non-isoaccepting pairs are 
shown in Fig. 1C (t-test, P = 1.0 x 1 0" 3 ). The differ- 
ence between the two types of codon pairs for inter- 
genic sequences is not only much smaller than that 
for the gene sequences but also smaller than that 
for the shuffled gene sequences. Based on this analysis 
and that by Cannarozzi et al., } the pattern of codon 
reuse exists in protein-coding sequences and rein- 
forces the hypothesis that the conserved pattern is 
rooted in translation selection. 

3.3. Varied strengths of codon reuse in prokaryotic 
genomes 

We investigated synonymous codon correlation in 
1 07 bacteria and 29 archaea. Taxonomic distribution 
of these genomes is summarized in Table 2. As can be 
seen, 107 bacteria are widely distributed across 14 
phyla, and 2 9 archaea are distributed across 5 phyla. 
Therefore, most known prokaryotic phyla have repre- 
sentatives in our dataset. To compare the strength 
of synonymous codon correlation among different 
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Table 2. Taxonomic distribution of 136 prokaryotic genomes 
analysed in this study 



Table 4. The mean Z scores and SD at the level of family 



Phylum 


Class 


Order 


Family 


Genus Species 


no. 


no. 


no. 


no. 


no. no. 


Bacteria 1 4 


19 


35 


55 


76 107 


Archaea 5 


14 


1 8 


27 


29 29 


Table 3. The mean Z scores and SD at the level of phylum 


Phylum 


Mean 


SD 


Genome no. 


VC (SD/mean) 


Actinobacteria 


7.49 


3.74 


8 


0.50 


Chlamydiae 


1.61 


0.41 


5 


0.26 


Firmicutes 


4.60 


1.37 


21 


0.30 


Proteobacteria 


6.70 


3.31 


54 


0.49 


Spirochaetes 


3.03 


0.31 


4 


0.10 


Tenericutes 


1.99 


0.81 


5 


0.41 


Crenarchaeota 


3.1 6 


0.71 


6 


0.23 


Euryarchaeota 


3.39 


1.57 


20 


0.46 




4.00 


1.53 


1 5.4 


0.34 



VC denotes variance coefficient in Table 3, 4 and 5, 
respectively. 

genomes, we calculated the Z general score for each 
genome. This value is equal to the Z score averaged 
for nine amino acids, and for each amino acid, the Z 
score is averaged for all isoaccepting tRNA pairs. 
First, we compared the general Z sores in the three 
domains of life. Among the eight eukaryotes, the 
mean value of the general Z scores is 1 5.0. However, 
the value is as small as 5.6 in bacteria and 3.2 in 
archaea. This indicates that the strength of codon 
reuse is much smaller in prokaryotes than in eukar- 
yotes, although the strength in the former is also sig- 
nificant. The 136 prokaryotic genomes, collectively, 
have a mean Z score of 5.1 and standard deviation 
of 3.1. Streptomyces coelicolor A3 has the largest Z 
score of 1 5.1 . Note that this genome has the highest 
G + C content and is one of largest of the 1 36 pro- 
karyote genomes. In contrast, Nanoarchaeum equitans 
Kin4-M has the smallest Z score of 0.72 and is also the 
smallest genome. Ninety-eight prokaryotic genomes 
have Z scores larger than 3.0, indicating that 72% of 
the genomes have significant usage bias of successive 
synonymous codons. Histogram of the general Z 
scores is shown in Fig. 2, and, as can be seen, the 
genera distribution could not be well fitted by a 
simple Gaussian function because it has two peaks. 

The phylogenetic relationship among the 1 36 pro- 
karyotic genomes is shown in Supplementary Fig. S1. 
This tree is constructed using the neighbor-joining 
method 26,27 based on 1 6S rRNA sequences. To com- 
prehend the phylogenetic tree, some factors (e.g. tax- 
onomy ID, organism name, general Z score, genome 



Family 


Mean 


SD 


Genome 
no. 


VC 

(SD/Mean) 


Mycobacteriaceae 




5.01 


1.08 


3 


0.22 


Chlamydiaceae 




1.44 


0.19 


4 


0.1 3 


Bacillaceae 




4.84 


2.29 


4 


0.47 


Lactobacillaceae 




4.84 


1.14 


3 


0.24 


Streptococcaceae 




4.92 


1.08 


5 


0.22 


Brucellaceae 




5.33 


0.14 


3 


0.03 


Burkholderiaceae 




8.78 


1.76 


3 


0.20 


Neisseriaceae 


I 


0.72 


3.85 


3 


0.36 


Enterobacteriaceae 


7.43 


4.29 


7 


0.58 


Pasteurellaceae 




4.32 


0.68 


3 


0.1 6 


Vibrionaceae 




8.02 


0.66 


5 


0.08 


Xanthomonadaceae 


Q 7C 

y./o 


1.56 


4 


0.1 6 


Spirochaetaceae 




TOO 

2.y 8 


0.37 


3 


0.1 2 


Mycoplasmataceae 


1.99 


0.81 


5 


0.41 






5.74 


1.42 


3.9 


0.24 


Table 5. The 


mean 


Z scores 


and SD 


at the level of genus 


Genus 


Mean 


SD 


Genome no. 


VC (SD/Mean) 


Mycobacterium 


5.01 


1.08 


3 




0.22 


Chlamydophila 


1.36 


0.1 1 


3 




0.08 


Bacillus 


4.84 


2.29 


4 




0.47 


Lactobacillus 


4.84 


1.14 


3 




0.24 


Streptococcus 


5.03 


1.22 


4 




0.24 


Brucella 


5.33 


0.14 


3 




0.03 


Vibrio 


8.08 


0.74 


4 




0.09 


Xanthomonas 


1 0.51 


0.43 


3 




0.04 


Mycoplasma 


2.07 


0.91 


4 




0.44 




5.23 


0.90 


3.4 




0.21 



size, and G + C content) of 136 prokaryotic 
genomes are listed in Supplementary Table S2. For 
each genome, the general Z score is marked on the 
right side of the name. When visualized in this 
manner, it is clear that the strength of synonymous 
codon correlation changes with phylogeny. In fact, 
there is no consistent pattern for Z score variation 
among different phylogenetic groups. Some groups 
have very similar scores, whereas others do not. To il- 
lustrate the trend from the phylogenetic order, 
Tables 3, 4 and 5 list the mean Z scores and standard 
deviation at three levels: phylum, family, and genus. As 
can be seen, the mean standard deviation changes 
from 1.53 to 1.42 to 0.90, when the classifying 
level changes from phylum to family to genus. 
Correspondingly, the mean variance coefficient 
changes from 0.34 to 0.24 to 0.21. Therefore, it 
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appears to naturally follow that the Z scores will be 
more similar at a lower phylogenetic level. 

3.4. Comparative analyses of codon reuse strengths 
among different groups 
The 136 prokaryotic genomes could be classified 
into two groups based on the Gram type, oxygen me- 
tabolism, growth rate, G + C content, genome size, 
and tRNA gene number, respectively. We performed 
comparative analyses of the general Z scores 
between any two groups based on the six classifying 
criteria and results are listed in Table 6. For all criteria 
except Gram type and oxygen metabolism, the two 
groups are divided equally based on median criterion 
values. Z scores of the two groups are generally signifi- 
cantly different based on all classifying criteria, except 
Gram type. Among the five indices with significant 
differences, genome size and tRNA gene number are 
the most sensitive as the P-value is the smallest. 



Table 6. Comparison of the general Z scores between any two 
groups based on six classification criteria 3 



Classifying criteria 



Mean SD 



Genome 
number 



P-value 



Gram type 

Gram negative 6.01 3.43 70 0.064 

Gram positive 4.87 2.64 35 
Growth rate b 

Fast 6.28 3.05 54 0.028 

Slow 4.92 3.27 53 
Oxygen metabolism 0 

Aerobic 6.39 3.05 35 0.01 7 

Anaerobic 4.58 2.04 16 
G + C content 

Low GC (<46.2%) 3.57 2.19 68 1.21e-09 

High GC (>46.2%) 6.62 3.1 1 68 
tRNA gene number 

Less tRNA (<32) 3.30 2.1 5 68 1.37e-13 

More tRNA (>32) 6.89 2.84 68 
Genome size 

Small size (<2. 55 Mb) 3.25 1.97 68 3.08e-14 

Large size (>2. 55 Mb) 6.93 2.90 68 

a Because information of the upper three factors is not avail- 
able for some of the genomes, the total genome number is 
less than 1 36 for these factors. Detailed information of each 
prokaryotic genomes is shown in Supplementary Table S2. 
b Original growth rate data were obtained from Vieira-Silva 
and Rocha 40 . Genomes with generation time longer than 
2 h are taken as slow growing, otherwise as fast growing. 
c Original data on oxygen metabolism were obtained 
from NCBI at ftp://ftp.ncbi.nIm.nih.gov/genomes/Bacteria/ 
lproks_0.txt. 



Based on the descending order of statistical difference, 
the other three indices will be G + C content, oxygen 
metabolism, and growth rate. It is interesting that the 
two most frequently used classifying criteria, oxygen 
metabolism and Gram type, are not associated with 
variance of synonymous codon pair strength as are 
genomic features such as G + C content, genome 
size, and tRNA gene number. 

3.5. Linear correlation between the general Z scores 
and genomic characteristics 
The strength of synonymous codon correlation 
varied largely across prokaryotic genomes, as shown 
above. Furthermore, we have identified potential 
determinant factors of this observation. It would be 
interesting to identify a quantitative relationship 
between Z scores and genomic factors. Three factors 
that could be directly extracted from chromosomal 
DNA sequences are chromosome size, G + C 
content, and tRNA gene number. After obtaining 
values of each factor for all 136 prokaryotic 
genomes, linear fitting was performed between 
them and the general Z scores. The least squares 
method 25 was used for linear fitting between them 
and the general Z scores. Their correlation coefficient 
(R) and significance (P-value) were computed, re- 
spectively, with R package. The scatter plot of 
general Z scores against the three factors and linear 
fitting are shown in Fig. 3. All three factors are signifi- 
cantly correlated (0.55 < = R < = 0.78, P< 1 0" 11 ) 
with the general Z scores. Genome size has the stron- 
gest correlation with general Z score, according to 
both R coefficient and P-value. This fact is consistent 
with the above comparative analysis, where genome 
size is shown to be the most effective distinguishing 
feature. In general, a phenomenon is often associated 
with multiple factors. The multiple linear regression 
method is more actual and effective than linear 
fitting method. To obtain a stronger correlation, mul- 
tiple linear regression was also performed. 
Chromosome size, G + C content, and tRNA gene 
number constitute the three explanatory variables, 
with general Z score as the dependent variable. 
Using R package, the regression equation is defined 
as Equation (2). 

Z= 0.943 x S + 0.072 x G + 0.064 

xT-3.271, (2) 

where Z denotes the general Z score, S denotes 
chromosome size, G denotes G + C content, T 
denotes tRNA gene number, and chromosome size is 
measured in millions of base pairs (Mb). The 
Pearson's coefficient (r value) of the multiple regres- 
sion is 0.8334 and the P=2.2x10" 16 . Using 



No. 6] F.B. Guo et al. 483 

ABC 



o 

CO 




o 

CO 




o 

CO 




m 

CM 


R=078 


in 

CM 


R=0.60 


in 

CM 


R=0.55 




p<2.Ze-16 




p=7.8e-15 




p=5 4e-12 


o 




o 

CM 




O 
CM 








in 




m 




a 




o 




o 




in 
o 


A^^'- 


m 
o 




m 
o 




•9 • 




ir> 




in 





0 2 4 6 8 10 20 30 40 50 60 70 80 0 20 40 60 80 

Genome size (Mb) G+C content (%) tRNA gene number 



Figure 3. Scatter plot of general Z scores against three factors (genome size, G + C content, and tRNA gene number) for 1 36 prokaryotic 
genomes. In the figure, each point corresponds to a prokaryotic genomes. (A) Scatter plot of general Z scores against genome size: linear 
fitting by least squares method. (B) Scatter plot of general Z scores against G + C content: linear fitting by least squares method. (C) 
Scatter plot of general Z scores against tRNA gene number: linear fitting by least squares method. 



Equation (2), we seek to predict the strength of suc- 
cessive synonymous codon pairs in any sequenced 
prokaryotic genome with some reliability. The predic- 
tion error is only 1 .72, and roughly speaking, the pre- 
diction accuracy is higher than 80%, given that the 
Pearson's coefficient is 0.8334. 



4. Discussion 

Cannarozzi et al. first observed the phenomenon of 
successive synonymous codon pairs in eukaryotic 
genomes in 201 0. 1 Fredrick and Ibba comment that 
this work is one of the best examples illustrating 
how codon usage patterns control ribosome speed 
and fine-tune translation to increase protein synthesis 
efficiency. 2 They call on immediate work on bacterial 
genomes in which translation rates are substantially 
higher than eukaryotes. 2 Interestingly, we found that 
most prokaryotic genomes have significant biases in 
successive synonymous codon pairs, suggesting that 
this pattern is universal to the three domains of life. 
Significantly, different distributions of synonymous 
codon pairs between protein-coding genes and inter- 
genic sequences suggest that this bias is a result from 
translation selection. Combined with the previous 
work, 1 we conclude that the bias of successive 
synonymous codon pairs, as universal pattern in all 
living organisms, would be a translation-associated 
effect and could be used to fine-tune protein 
synthesis. 

Furthermore, the strength of the bias varies strongly 
across different genomes. Eukaryotic cells have the 
strongest bias, whereas archaeal cells have the least. 
Among bacteria, there is also a range of differences. 
Usually, large bacterial genomes and those with high 
G + C content tend to have a stronger bias. This type 
of variation reflects the diversity of living species. 



A better understanding of the precise reason for 
varied strength may be clarified by comparing these 
results with single codon bias. Generally, highly 
expressed genes tend to have more of a bias with 
single codons 8,1 1,28-35 in unicellular organisms. 
Codon bias is thought to maximize translation effi- 
ciency, including speed and/or fidelity. 9,1 1,1 2 > 28 ' 30 - 35 
The strength of the codon bias is determined by the 
strength of translation selection exerted on the 
genome. 13 For example, species exposed to selection 
for rapid growth tend to have more strongly selected 
codon usage bias. 36,37 At the same time, fast- 
growing bacteria with low generation time generally 
have more tRNA genes to increase translation 
speed. 13,38 And, tRNA gene number is positively 
correlated with genome size and G + C content. 39 
Because the general Z scores of bacteria are also asso- 
ciated with these factors, the correlation between the 
bias strength of successive codon pairs and various 
genomic or biochemical characters in prokaryotes 
may be caused by the translation selection as 
the single codon bias. One lingering question is why 
single codon bias is almost absent, or at least 
smaller, in higher eukaryotic genomes. 37 However, 
the strength of successive codon pair bias is much 
stronger in higher eukaryotes than in prokaryotes. 
Our current thought is that this may be due to 
different translation mechanisms among the three 
domains of life. 

Another noticeable point is how translation influ- 
ences synonymous codon pair usage or why codon 
reuse favours translation efficiency. Cannarozzi et al. 
have proposed that tRNA diffusion is slower than 
both reloading and translation. 1 When the next 
amino acid that is incorporated is the same, a recently 
used tRNA would be more likely than any of its isoac- 
ceptors to still be in the vicinity of the ribosome. 1 
Therefore, reuse of isoaccepting codons would 
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accelerate the translation process. 1 Direct observation 
of slower tRNA diffusion than reloading and transla- 
tion would be the most potent proof for this 
hypothesis. 

Successive synonymous codon pair represents the 
order of protein-coding sequences rather than their 
composition, which is different from single codon 
use. 2,4 Therefore, the observed bias of successive syn- 
onymous codon pairs, as a newly observed phenom- 
enon, illustrates that regulatory information of 
protein translation is retained not only in nucleotide 
species but also in nucleotide order. Widely observed 
association of single codon, 3,5,6,8,9,11-13 codon 
pair 14-22 or successive synonymous codon pair 1,2 
use and translation efficiency suggests that the latter 
exerts influence on the various types of codons. On 
the other hand, genome size, G + C content and 
tRNA gene number are all significantly associated 
with the bias strength of synonymous codon pairs, il- 
lustrating that translation selection exerts influence 
on different genomic features. Taking all factors into 
consideration, we conclude that translation selection 
is exerted on genome sequences at multiple levels 
and by various mechanisms. In other words, protein 
translation is a complex process and is associated 
with various factors such as usage of single codons, 
codon pairs (in particular successive synonymous 
codon pair), genome size, genomic G + C content, 
and tRNA gene number. 
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